Documentation re-structure by githubnemo · Pull Request #3300 · huggingface/peft

githubnemo · 2026-06-03T21:11:17Z

The current state of the PEFT docs is not one of structure and I was constantly annoyed that whenever I wanted to change something there were several places that needed touching and they all felt disconnected. So this is my attempt at structuring the docs. Some of these ideas are quite old (discussed in 01/2025) but are still valid.

I've removed most of the code guides without replacement. That's not ideal, I think we should have code examples but I'm think they should be method-focused. Maybe one general example of a training workflow is sufficient because most methods follow the same scheme. I'd appreciate some feedback on this.

All details from the method guides (prompting, lora, oft/boft, etc.) are now integrated into the respective method pages instead. I would have hesitated to do this if these guides would have integrated information about the adapters but they didn't. I think it makes a lot more sense to have one place for each method to gather examples/tips/recommendations and that is now the package_refernce/<method> page. This page now also hosts a small space that shows the MetaMathQA (and potentially other) benchmark results highlighted for that method.

I've moved the LoRA initializations to package_reference/lora#Initialization and converted the init methods to <hfoption>-tags. This collapses them to a list but may reduce searchability through the document - at least firefox is not able to search 'through' the option tabs. This also doesn't make them appear in the ToC and people specifically searching for, say, PiSSA won't find it directly. I think that's OK though, since the search is able to locate it.

The quicktour is a bit more detailed about what happens under the hood (quick doesn't have to mean simplistic) and includes some new visualizations. I hope that we can integrate more visualizations in the future where it makes sense.

BenjaminBossan · 2026-06-04T11:01:45Z

Thanks a lot for revamping the PEFT docs, which I agree are not very user friendly at the moment. Could you please resolve the two merge conflicts so that preview docs could be rendered? I think it makes more sense to review the docs as a whole than going through the diff (which is probably showing a lot of text that has just moved places).

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:

Did you update doc links we may have in PEFT to ensure that they'll be up to date?
How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

The space was not that useful anymore since most methods are compatible with most models. The front page buttons are, at least temporarily, with the exception of the quicktour and method overview buttons. I like the visuals but there should only be elements that are useful.

HuggingFaceDocBuilderDev · 2026-06-05T14:08:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

githubnemo · 2026-06-05T17:42:19Z

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:
1. Did you update doc links we may have in PEFT to ensure that they'll be up to date?

I didn't at the time but now I have. There were 14 occurrences of now broken links, all of which are fixed now.

2. How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

I've added a _redirects.yml with the most common redirects I found (mostly from transformers). I also checked axolotl, diffusers and unsloth - the latter was not easy to analyze systematically as I couldn't find the docs as plain text, so I resorted to delegating to an agent which didn't find references to the PEFT docs.

The Hermes PEFT skill (https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/mlops/peft) doesn't seem to link to changed pages in the docs.

githubnemo · 2026-06-05T18:12:50Z

This also requires merging of https://huggingface.co/datasets/huggingface/documentation-images/discussions/625.

BenjaminBossan · 2026-06-08T08:59:12Z

This also requires merging of https://huggingface.co/datasets/huggingface/documentation-images/discussions/625.

Done

PR #3300 drafts the idea of embedding the method comparison results into the respective method pages. This calls for a lighter version of the existing space to limit the needed space. This is what `app_embed.py` is. Most of the common processing has moved to the existing and aptly named `processing.py`. I think that this is better than having a layout switch in `app.py` as these apps are meant to be as flat as can be to be readable and maintainable.

githubnemo · 2026-06-08T12:57:57Z

I think this is now ready for review. Sorry about the huge PR but dissolving the guides into the individual method pages made a relatively big splash in terms of changes, even though the individual changes are quite small.

@stevhliu it would be super cool if you could take a look as well :)

When reviewing the rendered doc on moon-ci-docs I noticed that the new images are rendered with borders (esp. visible in the quicktour) and the ToC indentation for LoRA variants is broken but I have no clue how to fix this. @stevhliu do you have an idea?

BenjaminBossan

Thanks a LOT for working on overhauling the PEFT docs. They always felt lacking and suboptimally structured to me, so I'm very happy to see improvements there.

For this review, I focused on the general sections but haven't reviewed the entries for the individual PEFT methods. This was in order to break down the review in smaller parts, as I'm not going to finish it today. It may also help avoid duplicate effort between me and Steven.

As a more general comment, I saw that some added parts contain manual line breaks, e.g. in overview.md. I would suggest to remove those completely.

I like the idea of including a benchmark overview for each PEFT method. Now that we have image generation too, it would be great to add an option to toggle the benchmark, but let's leave that to a future PR. I noticed, however, that not each PEFT method includes the benchmark, e.g. HRA is missing it. Also, some methods like HiRA have the graph but no corresponding data points, but maybe its result was added after the space was deployed?

I also wonder if we should not fully remove the legend, as the resulting graph can become quite cramped:

There is also a bit of an inconsistency about the legend, e.g. for Lily it only labels the line but not the points. I think it should be removed for simplicity.

BenjaminBossan · 2026-06-08T14:58:43Z

+  <div class="flex flex-col basis-1/4">
+    There are numerous methods to "adapt" existing models, often extensively integrating into the model. PEFT can be thought of as a framework for arbitrary methods of model adaption (modifying weights, wrapping layers, manipulating KV-caches, ...) while also serving as a reference implementation for many fine-tuning methods.
+  </div>
+  <div class="flex flex-col basis-3/4 pl-10 pr-10"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/adapter_installation.png" width="100%"></div>


This looks a bit odd (middle part) with dark theme:

BenjaminBossan · 2026-06-08T15:10:53Z


+## Multiple adapters
+
+PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.


Suggested change

PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.

PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters as you want by calling `peft_model.add_adapter(adapter_name=...)`.

BenjaminBossan · 2026-06-08T15:13:36Z

 model = AutoPeftModel.from_pretrained("smangrul/openai-whisper-large-v2-LORA-colab")
 ```

+## Multiple adapters


In the section above, the docs describe the AutoPeftModel API for loading trained adapters. I'm just wondering if we should not at the very least mention the PeftModel.from_pretrained(base_model, adapter_id) API as well.

BenjaminBossan · 2026-06-08T15:21:38Z

+
+## Choosing the right method
+
+Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.


I think as is, the last sentence doesn't quite make sense, even though it's clear what is meant. Here is a suggestion for a different wording.

Suggested change

Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.

Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length; some methods are more prone to memory spikes than others.

BenjaminBossan · 2026-06-08T15:22:42Z

+
+Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).
+
+## Chunked NLL loss


I'd put this section last, I think the other ones below are more generally applicable.

BenjaminBossan · 2026-06-08T15:28:00Z

+
+## Quantization
+
+Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).


Suggested change

Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).

Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incurred by quantization methods. Read the [PEFT quantization guide](quantization).

BenjaminBossan · 2026-06-08T15:29:24Z

+
+## Gradient Checkpointing
+
+You can trade memory with computation by only saving every nth gradient between layers and computing the rest on the fly. Check out the [gradient checkpointing](https://huggingface.co/docs/transformers/grad_checkpointing) documentation of Transformers to learn more.


Maybe worth mentioning that if not using Transformers or Diffusers, users may have to implement their own GC logic.

BenjaminBossan · 2026-06-08T15:32:51Z

+Giving general advice for training large models is hard but for generative
+models, especially language models, you can follow these steps:
+
+1. use prompting (few-shot examples in the prompt) to see if the model is


Suggested change

1. use prompting (few-shot examples in the prompt) to see if the model is

1. use prompting (e.g. few-shot examples in the prompt) to see if the model is

BenjaminBossan · 2026-06-08T15:39:00Z

+   fine-tuning step is potentially unlearning past knowledege.
+
+The [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) aims to give a rough overview of (most) implemented methods on selected benchmarks and models.
+


It could also be useful to mention some criteria here that may guide you in choosing the appropriate PEFT method:

quantization: not all methods support quantized base models

feature set: not all features are supported for all methods (e.g. multiple adapters, mixed adapter inference)

layer types: linear layers are generally always supported, but not all methods support embedding (important for expanding vocab) or conv (important for some image models)

inference runtime: PEFT methods generally add runtime overhead but some of that can be mitigated (e.g. some methods allow merging, removing the overhead)

BenjaminBossan · 2026-06-08T15:40:19Z

+
+## Layer Tuning
+
+Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)


"target specific layers" doesn't make it quite clear that it means that existing parameters of the base model are made trainable, since you could say that LoRA also targets specific layers. I would state that explicitly.

stevhliu

nice work! i focused mainly on memory_efficient-training and memory/overview in this pass

the ToC indentation for LoRA variants is broken

i think the doc-builder only supports 3 levels of nesting so maybe flatten the variants section?

new images are rendered with borders

the doc-builder automatically renders it with a border i believe. i would open an issue on the doc-builder repo for this :)

stevhliu · 2026-06-09T16:47:44Z

+
+Low-Rank Adaptation ([LoRA](https://huggingface.co/papers/2106.09685)) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices. This drastically reduces the number of parameters that need to be fine-tuned.

 The abstract from the paper is:


i think this is the wrong abstract

stevhliu · 2026-06-09T16:48:09Z

+In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, LoRA is typically only applied to the attention blocks in Transformer models - it may be worth targeting other layers as well. The resulting number of trainable parameters in a LoRA model depends on the size of the update matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix.

-## Utility
+You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#Initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.


Suggested change

You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#Initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.

You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.

stevhliu · 2026-06-09T17:19:33Z

+Giving general advice for training large models is hard but for generative
+models, especially language models, you can follow these steps:
+
+1. use prompting (few-shot examples in the prompt) to see if the model is


may be nice to reorder the categories to be more consistent

the intro workflow lists prompting, layer tuning, adapters

the page sections are ordered adapters, prompting, layer tuning

the toctree orders them as layer tuning, soft prompting, adapters. it would also be good to pick and use the same terms in the sidebar and here (prompt-based methods vs soft prompting)

stevhliu · 2026-06-09T17:21:29Z

+
+## Layer Tuning
+
+Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)


this section feels a bit thin. you could maybe add something about what distinguishes it more from prompting that makes it more expressive. otherwise, it'd be harder to pick between the two of them

stevhliu · 2026-06-09T17:21:55Z

+   and [adapter methods](#adapter-methods). These methods are generally
+   more expressive than prompt-based methods and get closer to full-finetuning.
+3. Make sure to measure retention of already learnt knowledge since each
+   fine-tuning step is potentially unlearning past knowledege.


Suggested change

fine-tuning step is potentially unlearning past knowledege.

fine-tuning step is potentially unlearning past knowledge.

stevhliu · 2026-06-09T17:23:14Z

+
+# Parameter efficient fine-tuning methods
+
+Training a model parameter efficiently means to train as few parameters as possible to achieve comparable performance to training all parameters, i.e. full fine-tuning. There is, of course, no free lunch: by using fewer and therefore less expressive, parameters, it is not guaranteed that you will get the same performance! You may need to use a specific PEFT method to get optimal results for the model/task combination you want to train. But you will need less memory and possibly less compute during training and may gain features such as fast hot-swapping between trained expert models and less forgetting of previous knowledge compared to full fine-tuning.


would be good to add a link to hot-swapping to make it concrete

Suggested change

Training a model parameter efficiently means to train as few parameters as possible to achieve comparable performance to training all parameters, i.e. full fine-tuning. There is, of course, no free lunch: by using fewer and therefore less expressive, parameters, it is not guaranteed that you will get the same performance! You may need to use a specific PEFT method to get optimal results for the model/task combination you want to train. But you will need less memory and possibly less compute during training and may gain features such as fast hot-swapping between trained expert models and less forgetting of previous knowledge compared to full fine-tuning.

PEFT methods train as few parameters as possible while aiming for performance comparable to full fine-tuning. Fewer trainable parameters are less expressive, so the same performance isn't guaranteed. In exchange you use less memory, often less compute, and gain features like fast hot-swapping between expert adapters and less forgetting of prior knowledge.

stevhliu · 2026-06-09T17:32:06Z

+
+Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.
+
+When using [TRL] you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.


Suggested change

When using [TRL] you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.

When using [TRL](https://huggingface.co/docs/trl) you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.

stevhliu · 2026-06-09T17:32:48Z

+
+# Memory Efficient Training
+
+🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.


Suggested change

🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.

🤗 PEFT makes fine-tuning parameter efficient, but not automatically memory efficient. This overview collects tips for cutting training memory and links to the detailed guides.

stevhliu · 2026-06-09T17:33:10Z

+
+## Chunked NLL loss
+
+Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.


Suggested change

Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.

Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks). You allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.

stevhliu · 2026-06-09T17:35:06Z

+
+Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.
+
+Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).


Suggested change

Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).

Consider [using trainable tokens](troubleshooting#using-trainable-tokens) when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens.

nemo added 12 commits May 29, 2026 14:36

Starting with prompt tuning methods

cce64a5

Moved prompt tuning methods to respective sections

485fcdb

Move IA3

02a13f9

Moving LoRA

0dc79b0

Moving from conceptual guides

a552f8e

More moving adapter stuff

2cb339f

Consolidation

08e3408

Some work on the quicktour + fixes

610c1da

Benchmark results for some adapters

6bf3be2

Integrade config guide instead

0087b91

Moved *OFT to subsections + docstring update

5805c57

add training efficiency guide

65f2f2b

nemo added 2 commits June 4, 2026 13:10

Merge remote-tracking branch 'hf/main' into feature/doc-restructuring

b883ca8

githubnemo mentioned this pull request Jun 5, 2026

Separate, embeddable method comparison space #3303

Merged

nemo added 5 commits June 5, 2026 16:38

Minor fixes + broken links

fe7dcee

Use new space URL

8bc7a20

Sort methods alphabetically in TOC

fe18c6d

Missing benchmark widgets

492a935

Add redirects for common referenced articles

d07dbdb

nemo added 2 commits June 5, 2026 20:16

Change image links to dataset links

631cd29

Chunked NLL image to dataset as well

2c82047

Fix ruff issues

cdbc670

fix typo

56f213b

Address some left-over issues from the notion doc

0d7a7a6

githubnemo marked this pull request as ready for review June 8, 2026 12:54

githubnemo requested review from BenjaminBossan and stevhliu June 8, 2026 12:54

BenjaminBossan requested changes Jun 8, 2026

View reviewed changes

BenjaminBossan mentioned this pull request Jun 9, 2026

docs: fix typos in OFT conceptual guide #3311

Open

4 tasks

stevhliu reviewed Jun 9, 2026

View reviewed changes


		## Multiple adapters

		PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.


		## Choosing the right method

		Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for maximum accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.


		Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).

		## Chunked NLL loss


		## Quantization

		Quantization is one of the best ways to reduce memory consumption of the base model and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).


		## Gradient Checkpointing

		You can trade memory with computation by only saving every nth gradient between layers and computing the rest on the fly. Check out the [gradient checkpointing](https://huggingface.co/docs/transformers/grad_checkpointing) documentation of Transformers to learn more.

	1. use prompting (few-shot examples in the prompt) to see if the model is
	1. use prompting (e.g. few-shot examples in the prompt) to see if the model is

		fine-tuning step is potentially unlearning past knowledege.

		The [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) aims to give a rough overview of (most) implemented methods on selected benchmarks and models.


		## Layer Tuning

		Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)


		Low-Rank Adaptation ([LoRA](https://huggingface.co/papers/2106.09685)) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices. This drastically reduces the number of parameters that need to be fine-tuned.

		The abstract from the paper is:


		# Parameter efficient fine-tuning methods

		Training a model parameter efficiently means to train as few parameters as possible to achieve comparable performance to training all parameters, i.e. full fine-tuning. There is, of course, no free lunch: by using fewer and therefore less expressive, parameters, it is not guaranteed that you will get the same performance! You may need to use a specific PEFT method to get optimal results for the model/task combination you want to train. But you will need less memory and possibly less compute during training and may gain features such as fast hot-swapping between trained expert models and less forgetting of previous knowledge compared to full fine-tuning.


		Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.

		When using [TRL] you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.


		# Memory Efficient Training

		🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.

	🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.
	🤗 PEFT makes fine-tuning parameter efficient, but not automatically memory efficient. This overview collects tips for cutting training memory and links to the detailed guides.

	Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).
	Consider [using trainable tokens](troubleshooting#using-trainable-tokens) when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens.

Conversation

githubnemo commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Jun 4, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2026

Uh oh!

githubnemo commented Jun 5, 2026

Uh oh!

githubnemo commented Jun 5, 2026

Uh oh!

BenjaminBossan commented Jun 8, 2026

Uh oh!

githubnemo commented Jun 8, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

githubnemo commented Jun 3, 2026 •

edited

Loading